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Abstract 

We present a general regularization-based framework for Multi-task learning (MTL), 
in which the similarity between tasks can be learned or refined using ip-noiuY Multiple 
Kernel learning (MKL). Based on this very general formulation (including a general loss 
function), we derive the corresponding dual formulation using Fenchel duality applied 
to Hermitian matrices. We show that numerous established MTL methods can be de¬ 
rived as special cases from both, the primal and dual of our formulation. Furthermore, 
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we derive a modern dual-coordinate descend optimization strategy for the hinge-loss 
variant of our formulation and provide convergence bounds for our algorithm. As a 
special case, we implement in C-I--I- a fast LibLinear-style solver for £p-norm MKL. In 
the experimental section, we analyze various aspects of our algorithm such as predic¬ 
tive performance and ability to reconstruct task relationships on biologically inspired 
synthetic data, where we have full control over the underlying ground truth. We also 
experiment on a new dataset from the domain of computational biology that we col¬ 
lected for the purpose of this paper. It concerns the prediction of transcription start 
sites (TSS) over nine organisms, which is a crucial task in gene finding. Our solvers 
including all discussed special cases are made available as open-source software as part 
of the SHOGUN machine learning toolbox (available at http://shogun.ml). 


1 Introduction 

One of the key challenges in computational biology is to build effective and efficient statis¬ 
tical models that learn from data to predict, analyze, and ultimately understand biological 
systems. Regardless of the problem at hand, however, be it the recognition of sequence 
signals such as splice sites, the prediction of protein-protein interactions, or the modeling of 
metabolic networks, we frequently have access to data sets for multiple organisms, tissues 
or cell-lines. Can we develop methods that optimally combine such multi-domain data? 

While the field of Transfer or Multitask Learning enjoys a growing interest in the Ma¬ 
chine Learning community in recent years, it can be traced back to ideas from the mid 90’s. 
During that time Thrun (1996) asked the provocative question ”Is Learning the n-th Thing 
any Easier Than Learning the First?”, effectively laying the ground for the field of Transfer 
Learning. Their work was motivated by findings in human psychology, where humans were 
found to be capable of learning based on as few as a single example (Ahn and Brewer, 
1993). The key insight was that humans build upon previously learned related concepts, 
when learning new tasks, something Thrun (1996) call lifelong learning. Around the same 
time, Caruana (1993, 1997) coined the term Multitask Learning. Rather than formalizing 
the idea of learning a sequence of tasks, they propose machinery to learn multiple related 
tasks in parallel. 

While most of the early work on Multitask Learning was carried out in the context 
of learning a shared representation for neural networks (Caruana, 1997; Baxter, 2000), 
Evgeniou and Pontil (2004) adapted this concept in the context of kernel machines. At 
first, they assumed that the models of all tasks are close to each other (Evgeniou and 
Pontil, 2004) and later generalized their framework to non-uniform relations, allowing to 
couple some tasks more strongly than others (Evgeniou et al., 2005), according to some 
externally defined task structure. In recent years, there has been an increased interest in 
learning the structure potentially underlying the tasks. Ando and Zhang (2005) proposed 
a non-convex method based on Alternating Structure Optimization (ASO) for identifying 
the task structure. A convex relaxation of their approach was developed by Chen et al. 
(2009). Zhou et al. (2011) showed the equivalence between ASO and Clustered Multitask 
Learning (Jacob et al., 2008; Obozinski et al., 2010) and their convex relaxations. While the 
structure between tasks is defined by assigning tasks to clusters in the above approaches, 
Zhang and Yeung (2010) propose to learn a constrained task covariance matrix directly and 
show the relationship to Multitask Feature Learning (Argyriou et al., 2007, 2008a,b; Liu 


2 



et al., 2009). Here, the basic idea is to use a LASSO-inspired (Tibshirani, 1996) £ 2 , 1-1101111 
to identify a subset of features that is relevant to all tasks. 

A challenge remains to find an adequate task similarity measure to compare the multi¬ 
ple domains and tasks. While existing parameter-free approaches such as Romera-Paredes 
et al. (2013) ignore biological background knowledge about the relatedness of the tasks, in 
this paper, we present a parametric framework for regularization-based multitask learning 
that subsumes several approaches and automatically learns the task similarity from a set of 
candidates measures using £p-norm Multiple Kernel learning (MKL) see, for instance, Kloft 
et al. (2011). We thus provide a middle ground between assuming known task relation¬ 
ships and learning the entire task structure from scratch. We propose a general unifying 
framework of MT-MKL, including a thorough dualization analysis using Fenchel duality, 
based on which we derive an efficient linear solver that combines our general framework 
with advances in linear SVM solvers and evaluate our approach on several datasets from 
Computational Biology. 

This paper is based on preliminary material shown in several conference papers and 
workshop contributions (Widmer et ah, 2010a,c,b, 2012; Widmer and Ratsch, 2012), which 
contained preliminary aspects of the framework presented here. This version additionally 
includes a unifying framework including Fenchel duality analysis, more complete derivations 
and theoretical analysis as well as a comparative study in multitask learning and genomics, 
where we brought together genomic data for a wide range of biological organisms in a 
multitask learning setting. This dataset will be made freely available and may serve as a 
benchmark in the domain of multitask learning. Our experiments show that combining data 
via multitask learning can outperform learning each task independently. In particular, we 
hnd that it can be crucial to further rehne a given task similarity measure using multitask 
multiple kernel learning. 

The paper is structured as follows: In Section 2 we introduce a unifying view of multi¬ 
task multiple kernel learning that covers a wide range loss functions and regularizers. We 
give a general Fenchel dual representation and a representer theorem, and show that the 
formulation contains several existing formulations as special cases. In Section 3 we propose 
two optimization strategies: one that can be applied out of the box with any custom set 
of kernels and another one that is specifically tailored to linear kernels as well as string 
kernels. Both algorithms were implemented into the Shogun machine learning toolbox. In 
Section 4 we present results of empirical experiments on artificial data as well as a large 
biological multi-organism dataset curated for the purpose of this paper. 


2 A Unifying View of Regularized Multi-Task Learning 

In this section, we present a novel multi-task framework comprising many existing formu¬ 
lations, allowing us to view prevalent approaches from a unifying perspective, yielding new 
insights. We can also derive new learning machines as special instantiations of the gen¬ 
eral model. Our approach is embedded into the general framework of regularization-based 
supervised learning methods, where we minimize a functional 

yi{w) + C T(w), 
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which consists of a loss-term 2,{w) measuring the training error and a regularizer iH(ia) 
penalizing the complexity of the model w. The positive constant C > 0 controls the trade¬ 
off of the criterion. The formulation can easily be generalized to the multi-task setting, 
where we are interested in obtaining several models parametrized by rui,..., wt, where T 
is the number of tasks. 

In the past, this has been achieved by employing a joint regularization term 
... ,wt) that penalizes the discrepancy between the individual models (Evgeniou 
et ak, 2005; Agarwal et ak, 2010), 

m{wi, .. .,wt) + C£{wi, 

A common approach is, for example, to set ... ,wt) = ^YlJt=i 9 st\\'Ws — WtW"^ , 

where Q = {qst)a<s t<T ® similarity matrix. In this paper, we develop a novel, 
general framework for multi-task learning of the form 

min mw,6) + CiliW), 

Wfi 

where W = {Wm)i<m<M-, Wm = (wmi, ■ ■ ■ This approach has the additional 

flexibility of allowing us to incorporate multiple task similarity matrices into the learning 
problem, each equipped with a weighting factor. Instead of specifying the weighting fac¬ 
tor a priori, we will automatically determine optimal weights from the data as part of the 
learning problem. We show that the above formulation comprises many existing lines of 
research in the area; this not only includes very recent lines but also seemingly different 
ones. The unifying framework allows us to analyze a large variety of MTL methods jointly, 
as exemplified by deriving a general dual representation of the criterion, without making as¬ 
sumptions on the employed norms and losses, besides the latter being convex. This delivers 
insights into connections between existing MTL formulations and, even more importantly, 
can be used to derive novel MTL formulations as special cases of our framework, as done 
in a later section of this paper. 

2.1 Problem Setting and Notation 

Let D = {(xi, yi),..., {xn, Vn)} be a set of training pattern/label pairs. In multitask learn¬ 
ing, each training example {xi, yi) is associated with a task r(i) G {1,..., T}. Furthermore, 
we assume that for each t G {1,... ,T} the instances associated with task t are indepen¬ 
dently drawn from a probability distribution Pt over a measurable space Xtxyt. We denote 
the set of indices of training points of the tth. task by It := {i G {1,... ,n} ; T(f) = t}. 
The goal is to find, for each task t G {1,...,T}, a prediction function /^ ; A —)■ M. In 
this paper, we consider composite functions of the form ft : x 

1 < t < T, where (pm '■ A —)■ Pm, 1 < m < M, are mappings into reproducing Hilbert 
spaces Pi,... ,Pm, encoding multiple views of the multi-task learning problem via ker¬ 
nels km{x,x) = {ipm{x),p>m{x)), and W := {wmt)i<m<M,i<t<T, Wmt G Pm are parameter 
vectors of the prediction function. 

For simplicity of notation, we concentrate on binary prediction, i.e., y = {—1,1}, and en¬ 
code the loss of the prediction problem as a loss term il(VF) := KyifT{i){xi)), where I : 
M —>• Mu (oo) is a loss function, assumed to be closed convex, lower bounded and finite at 0. 
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To consider sophisticated couplings between the tasks, we introduce so-called task-similarity 

matrices Qi,...,Qm G GLn(M) with Qm = {qmst)l<s,t<T, Qm^ = {Qmst)i<s,t<T 

sider the regularizer iR 0 {W) = W'^rnWg^/Gm (setting 1/0 ;= oo, 0/0 := 0) with 

lllTmllg^ := tliWmQmW^) = Qmst {Wms,Wmt), where Wm = {Wml, ■ ■ ■ ,WmT) G 

=: with adjoint and tr(-) denotes the trace class operator of the tensor 

Hilbert space T-Lm ® 'Hm- Note that also the direct sum Ti := 0^=1 is a Hilbert space, 
which will allow us to view W G ^ as an element in a Hilbert space. The parameters 
^ — i&m)l<m<M G 0p, 0p ;= {6 G : 6*™ > 0,1 < m < M, \\6\\p < 1}, are adaptive 

weights of the views, where ||0||p = denotes the £p-norm. Here 0^0 denotes 

0m > 0, m = 1,..., M. 

Using the above specification of the regularizer and the loss term, we study the following 
unifying primal optimization problem. 

Problem 1 (Primal problem). Solve 

inf ^eiW) + CS.{A(W)), 

e& 0 „,w&'H V V 


where 

1 ^ \\W IP 

^^eiW) ft ’ WrnWl^ ■■= tr(lUmQmlU/,) 

m=l 

n M 

£{A{W)) :=Y,l{MW)) , A{W) := {A{W)) 

l<i<n ’ (lU) := y, Y, {w mT{i )) ■ 

i=l m=l 

2.2 Dualization 

Dual representations of optimization problems deliver insight into the problem, which can 
be used in practice to, for example, develop optimization algorithms (so done in Section 3 of 
this paper). In this section, we derive a dual representation of our unifying primal optimiza¬ 
tion problem, i.e.. Problem 1. Our dualization approach is based on Fenchel-Rockafellar 
duality theory. The basic results of Fenchel-Rockafellar duality theory for Hilbert spaces are 
reviewed in Appendix A. We present two dual optimization problems: one that is dualized 
with respect to W only (i.e., considering 6 as being fixed) and one that completely removes 
the dependency on 6 . 

2.2.1 Computation of Conjugates and Adjoint Map 

To apply Fenchel’s duality theorem, we need to compute the adjoint map A* of the linear 
map A-.n^W^, A{W) = (Aj(FU))^<.<^, as well as the convex conjugates of fH and £. 
See Appendix A for a review of the definitions of the convex conjugate and the adjoint map. 
First, we notice that, by the basic identities for convex conjugates of Prop. 10 in Appendix 
A, we have that 

(CT(«))* = C£*{a/C) = G(^J^^/(ai/C))* = ^^^^^^^*(0^/0. 


5 



Next, we define A* : ^ V. hy A*{a) = (' Recall that 

the mapping between tasks and examples may be expressed in one of two ways. We may use 
index set It to retrieve the indices of training examples associated with task t. Alternatively, 
we may use task indicator r(f) G {1, ... ,T} to obtain the task index r(f) associated with 
ith training example. Using this notation, we verify that, for any W G H and a G M”", it 
holds 


{W,A*{a)) 


M T 

EEE am {Wmt, I>miXi)) 

m=l t=l i£lt 
n M 

EE aiUi ^Pm{xi)^ 

i=l m=l 

{AiW),a) . 


Thus, A* as defined above is indeed the adjoint map. Finally, we compute the conjugate of 
IH with respect to W, where we consider 0 as a constant (be reminded that Qm are given). 
We write rm(lTm) := \ ll^mllQ^ and note that, by Prop. 10, 

/ M \ * M 

\m=l / m=l 

Furthermore, 

= sup {Vm,Wm) - l.tr{VmQmVm) ■ (1) 

w ,_ I _, 

= :b(Un) 

The supremum is attained when Vvm'^iYm) = 0 so that in the optimum Vm = Ql^Wm- 
Resubstitution into (1) gives r*(Wm) = ^ tr(kFm(5“^kFm) = h ||lTm||n-i) so that we have 

M 

^iiw) = 2 E ■ 

m=l 


2.2.2 Dual Optimization Problems 


We may now apply Fenchel’s duality theorem (cf. Theorem 9 in Appendix A), which gives 
the following dual MTL problem: 

Problem 2 (Dual problem—partially dualized minimax formulation). Solve 


where 


inf sup -Tt;(A*(Q)) - CS,*{-cx/C), 

®S 0 p ctgM" 


.. M n 

9iS(V(a)) = - |K(a)||^-, , £•(«) = j;r(a.). 

m=l 2=1 

A (a) := (A„ 2 (q;))]^< 222 <M > ^m(^) ~ f ^ ''^ icL ^ • 


( 2 ) 


( 3 ) 
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The above problem involves minimization with respect to (the primal variable) 6 and 
maximization with respect to (the dual variable) a. The optimization algorithm presented 
later in this paper will optimize is based on this minimax formulation. However, we may 
completely remove the dependency on 0 , which sheds further insights into the problem, 
which will later be exploited for optimization, i.e., to control the duality gap of the computed 
solutions. 

To remove the dependency on 6, we first note that Problem 2 is convex (even affine) in 
6 and concave in a and thus, by Sion’s minimax theorem, we may exchange the order of 
minimization and maximization; 

1 ^ 

Eq.(2) = inf sup - ^ 0 ^ P);^(a)||2 _CT*(-a/C) 

0 e 0 p Q, 6 ]Rn 2 

1 ^ 

= sup - sup - V Pm(a)||Q-i + C2.*{-Cx/C) 

= sup (\\AUcx)\\l.,) +C2r{-cx/C) 

aSM" 2 V J l<m<M 

where the last step is by the definition of the dual norm, i.e., sup 0 g 0 ^ ~ ll^llp* 

P* •= p/{p ~ 1) denotes the conjugated exponent. We thus have the following alternative 
dual problem. 

Problem 3 (Dual problem—completely dualized formulation). Solve 

sup -J (\\A*M\\l.^) +C£*{-<x/C) 

ctSM" 2 V '<"* / l<m<M p* 

where 

n 

= = (Yj-ar O^iVi^PruiXi)) ■ 

^^ V ^ —^*e/t / i<t<T 

i=l 

2.3 Representer Theorem 

Fenchel’s duality theorem (Theorem 9 in Appendix A) yields a useful optimality condition, 
that is, 

{W\a*) optimal 4^ W* = Vg*{A*(a*)), 

under the minimal assumption that g o A* is differentiable in a*. The above requirement 
can be thought of as an analog to the KKT condition stationarity in Lagrangian duality. 
Note that we can rewrite the above equation by inserting the definitions of g and A from 
the previous subsection; this gives, for any m = 1 ,..., M, 

VtTT. = 1, . . . , M : IPm ~ ^mQm f ^ ^ yiPm{Xi)\ , 

/ i<t<r 

which we may rewrite as 

n 

Vm = 1, . . . , M, t = 1, . . . , T : = OmY • (^) 

i=l 
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The above equation gives us a representer theorem (Argyriou et ah, 2009) for the optimal 
W*, which we will exploit later in this paper for deriving an efficient optimization algorithm 
to solve Problem 1. 


2.4 Relation to Multiple Kernel Learning 


Evgeniou et al. (2005) introduce the notion of a multi-task kernel. We can generalize this 
framework by defining multiple multi-task kernels 


km{Xi, Xj) := qi,rl)Tij)^rn{Xi, Xj) , m = 1, . . . , 


M. 


To see this, first note that the term ||yl))j(Q;)|| q_i can alternatively be written as 
P:n(«)llQ-i =tr {A*^{cx)Q;^ A*^{cxr) 


aiVi^miXi)) a^yi^ra{Xi)) 


l<t<T 


T 

onyi^m^Xi) 

T 

Y] oaajyiyj (pm{xi)ipm{xj) 


s,t=l 


d-i) 


= km{xi,Xj) 


Y aiajyiyj qY(\)riYrn{xi, xj) 

i,j=l '' -^-V-' 

km{xi,Xj) 


so it follows 




n M 

){A ( q )) = ~ ^ ^ OiiOijyiyj ^ ^ OmkmiXiiXj') 


i,j=i 


m=l 


and thus Problem 2 becomes 


M 


inf sup - y] otiajyiyj Y] 0mkm{xi,Xj) - C £,*{-cx/C), 

e&Bp ctSM" ^ 1 1 

^ t,j=l m=l 


(5) 


( 6 ) 


(7) 


which is an £p-regularized multiple-kernel-learning problem over the kernels ki,, kM 
(Kloft et ah, 2008b, 2011). 


2.5 Specific Instantiations of the Framework 

In this section, we show that several regularization-based multi-task learning machines are 
subsumed by the generalized primal and dual formulations of Problems 1-2. As a first 
step, we will specialize our general framework to the hinge-loss, and show its primal and 
dual form. Based on this, we then instantiate our framework further to known methods 
in increasing complexity, starting with single-task learning (standard SVM) and working 
towards graph-regularized multitask learning and its relation to multitask kernels. Finally, 
we derive several novel methods from our general framework. 




loss l{a), a G M 

dual loss l*{a) 

hinge loss 

max(0,1 — a) 

\ 

\ a, if — 1 < a < 0 
loo, elsewise 

logistic loss 

log(l -1- exp(-a) 

1 

I —alog(—a) -1- (1 -1- a) log(l -I-a), if —l<a<0 

1 

(oo, elsewise 


Table 1; Examples of loss functions and corresponding conjugate functions. See Ap¬ 
pendix B. 


2.5.1 Hinge Loss 

Many existing multi-task learning machines utilize the hinge loss l{a) = max(0,1 — a). 
Employing the hinge loss in Problem 1, yields the loss term 

n / ^ 

Q{A{W)) = ^ max f 0,1 - {Wmrii), ^m{xi)) 

i=l ^ ^ 

Furthermore, as shown in Table 1, the conjugate of the hinge loss is I* (a) = a, if —1 < a < 0 
and oo elsewise, which is readily verified by elementary calculus. Thus, we have 

n n 

-C£*{-cx/C) = -CY,i*{-ai/C) = Y,(^i, ( 8 ) 

i=l i=l 

provided that Vi = 1,..., n ; 0 < Oj < C; otherwise we have —C 2,*{—ot/C) = —oo. Hence, 
for the hinge-loss, we obtain the following pair of primal and dual problem. 



Primal: 


inf 

0e©p 

W&H 


HIT, 


1 ^ 

^ 1 
m=l 


2 


-|- C max 
i=l 


0,1-ViY^ 


M 




(9) 


Dual: 


inf sup 
0e0p o^ct^c 


1 

2 


n M n 

ij=l m=l i=l 


( 10 ) 


2.5.2 Single Task Learning 

Starting from the simplest special case, we briefly show how single-task learning methods 
may be recovered from our general framework. By mapping well understood single-task 
methods onto our framework, we hope to achieve two things. First, we believe this will 
greatly facilitate understanding for the reader who is familiar with standard methods like 
the SVM. Second, we pave the way for applying efficient training algorithms developed in 
Section 3 to these single-task formulations, for example yielding a new linear solver for 
non-sparse Multiple Kernel Learning as a corollary. 
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Support Vector Machine In the case of the single-task [W = w, Q = 1), single kernel 
SVM (M = 1), the primal from Equation 9 and dual from Equation 2.5.1 can be greatly 
simplified: 

1 

inf -||re|P C max (0,1 - yi(w,^p{xi))) , 

i=\ 

which corresponds to the well-established linear SVM formulation (without bias). Similarly, 
the dual is readily obtained from Equation 2.5.1 and is given by 


sup 


^ n n 

2,^ = 1 2=1 


MKL -norm MKL (Kloft et ah, 2011) is obtained as a special case of our framework. 
This case is of particular interest, as it allows to obtain a linear solver for .^p-norm MKL, 
as a corollary. By restricting the number of tasks to one (i.e., T = 1), Wm becomes Wm 
and Q = 1. Equation (9) reduces to: 

.. M II ||2 n , ^ 

m=l i=l ^ ^ 

In agreement with Kloft et al. (2009a), we recover the dual formulation from Equation 2.5.1. 

^ n M n 

2 ^ ^ ^ ^ ymkm{,Xi,Xj) -\- ^ ^ Otj • 

i,j=l m=l i=l 


inf sup 
0e0p o^ct^c 


2.5.3 Multitask Learning 

Here, we first derive the primal and dual formulations of regularization-based multitask 
learning as a special case of our framework and then give an overview of existing variants that 
can be mapped onto this formulation as a precursor to novel instantiations in Section 2.6. 
In this setting, we deal with multiple tasks t, but only a single kernel or task similarity 
measure Q (i.e., M = 1). The primal thus becomes: 

1 ” 

vra C -yi{'^r{i),^{xi))) , (11) 

2=1 

with corresponding dual 

^ n n 

otiajyiyjk{xi,Xj) + a*, (12) 

o-<a-<c 2 ^ 

where the definition of k is given in Equation 5. As we will see in the following, the above 
formulation captures several existing MTL approaches, which can be expressed by choosing 
different encodings Q for task similarity. 
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Prustratingly Easy Domain Adaptation An appealing special case of Graph- 
regularized MTL was presented by Daume (2007). They considered the setting of only 
two tasks (source task and target task), with a fix task relationship. Their frustratingly 
easy idea was to assign a higher similarity to pairs of examples from the same task than 
between examples from different tasks. In a publication titled Frustratingly Easy Domain 
Adaptation, Daume (2007) present a simple, yet appealing special case of graph-regularized 
MTL. They considered the setting of only two tasks (source task and target task), with a 
hx task relationship (i.e., the influence of the two tasks on each other was not determined 
by their actual similarity). Their idea was to assign a higher base-similarity to pairs of 
examples from the same task than between examples from different tasks. This may be 
expressed by the following multitask kernel: 

~ j2k{x,z) t(x)=t(z) 

yk{x,z) else. 

From the above, we can readily read off the corresponding Q~^ (and compute Q). 



Given the above, we can express this special case in terms of Equation (11) and (12). With 
some elementary algebra, this method can be viewed as pulling weight vectors of source Wg 
and target Wt towards a common mean vector w by means of a regularization term. If we 
generalize this idea to allow for multiple cluster centers, we arrive at task clustering, which 
is described in the following. 


Task Clustering Regularization Here, tasks are grouped into M clusters, whereas 
parameter vectors of tasks within each cluster are pulled towards the respective cluster 
center Wm = Y^=i where is the number of tasks in cluster m (Evgeniou et ah, 
2005). To understand what Q and Q~^ correspond to in terms of Equations 11 and 12, 
consider the dehnition of the multitask regularizer 91 for task clustering. 


M 


R{wi,..., Wt) = X X] + X] 


u=i 

T 


m=l 

T 


P Ikmf + '^p\n \\wt - Wmf 

V t=l / 


Gs,t{Ws,Wt) 


. t=l 


S,t=l 


ti (w{XI + G)W^^ , 


(13) 

(14) 

(15) 


where M is the number of clusters, > 0 encodes assignment of task t to cluster m, p 
controls regularization of cluster centers Wm and G are given by 


M 


Gs,t = E - 


m=l 


Pm Pm 

p + Y)J=i Pv 


11 



If any task t is assigned to at least one cluster m (i.e., Vt3m ; > 0) G is positive definite 

(Evgeniou et al., 2005) and we can express the above in terms of our primal formulation 
in Equation 11 as Q = {XI + G) and the corresponding dual as Q~^ = {XI + even 

for A = 0. We note that the formulation given in Section 2.5.3 may by expressed via task 
clustering regularization, by choosing only one cluster (i.e., M = 1) and setting A = 0, 
p = 1 and gg^ equating to the task similarity matrix 

Q from the previous section. 


Graph-regularized MTL Graph-regularized MTL was established by Evgeniou et al. 
(2005) and constitutes one of the most influential MTL approaches to date. Their method is 
based on the following multi-task regularizer, which also forms one of the main inspirations 
for our framework: 


= 2 (EL 11^*11'+ 

= itr(w(/ + L)lTT) , 


(16) 

(17) 

(18) 


where A = {ast)i<s,t<T £ is a given graph adjacency matrix encoding the pairwise 

similarities of the tasks, L = D — A denotes the corresponding graph Laplacian, where 
Dip := SipY2k^i,ky / is a T X T identity matrix. Note that the number of zero 
eigenvalues of the graph Laplacian corresponds to the number of connected components. 
We may view graph-regularized MTL as an instantiation of our general primal problem. 
Problem 1, where we have only one task similarity measure Qi = I + L (i.e., M = 1). As 
the graph Laplacian L is not invertible in general, we use its pseudo-inverse to express 
the dual formulation of the above MTL regularizer. 


Qs,t = = E (19) 

i=l 


where r is the rank of L, di are the eigenvalues of L and V = {v^p) is the orthogonal matrix 
of eigenvectors. 


Multi-task Kernels In contrast to graph-regularized MTL, where task relations are 
captured by an adjacency matrix or graph Laplacian as discussed in the previous paragraph, 
task relationships may directly be expressed in terms of a kernel on tasks Ktasks- This 
relationship has been illuminated in Section 2.4, where we have seen that the kernel on tasks 
corresponds to Q~^ in our dual MTL formulation. A formulation involving a combination 
of several MTL kernels with a fix weighting was explored by Jacob and Vert (2008) in the 
context of Bioinformatics. In its most basic form, the authors considered a multitask kernel 
of the form 

K{{x, t'), {z, s')) — K]jsijse{x , z) • Ktasks{tj s). 

Eurthermore, the authors considered a sum of different multi-task kernels, among them the 
corner cases K-£,i^ac{t, s) = 5sp (independent tasks) and the uniform kernel K\jai{t,s) = 1 
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(uniformly related tasks). In general, their dual formulation is given by 

M 

K{{x,t),{z,s)) = Kbase(x,z) • 

m=l 

The above is a very interesting special case and can easily be expressed within our 
general framework. For this, consider the dual formulation given in Equation 2.5.1 for 
Q{m)-i _ and 6i = ... = 9 m = 1. In other words, the above also constitutes a form 

of multitask multiple kernel learning, however, without actually learning the kernel weights 
0m. Nevertheless, the choice and discussion of different multitask kernels ™ Jacob 

and Vert (2008) is of high relevance with respect to the family of methods explored in this 
work. 


2.6 Proposing Novel Instances of Mnlti-task Learning Machines 


We now move ahead and derive novel instantiations from our general framework. Most 
importantly, we go beyond previous formulations by learning or refining task similarities 
from data using MKL as an engine. 



(c) Smooth MT-MKL 


Figure 1: Learning additive transformations of task similarities; (a) Multigraph MT-MKL 
where one combines similarities from multiple independent graphs (which includes the ap¬ 
proaches proposed in Widmer et al. (2010c); Jacob and Vert (2008)); (b) Hierarchical MT- 
MKL where one uses a tree to generate specihc similarity matrices (as proposed in Widmer 
et al. (2010a,c); Gornitz et al. (2011); Widmer et al. (2012)); and (c) Smooth MT-MKL 
where one uses multiple transformations of an existing similarity matrix for linear combi¬ 
nation. 
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2.6.1 Multi-graph MT-MKL 


One of the most popular MTL approaches is graph-regularized MTL by Evgeniou and Pontil 
(2004). We have seen in Section 2.5.3, that such a graph is expressed as a adjacency matrix A 
and may alternatively be expressed in terms of its graph Laplacian L. Our extension readily 
deals with multiple graphs encoding task similarity Am = {amst)i<s,t<T £ which is 

of interest in cases where - as in Multiple kernel learning - we have access to alternative 
sources of task similarity and it is unclear which one is best suited. This concept gives rise 
to the multi-graph MTL regularizer 

R{W) = 2 tr {yZ=i ’ 

where Lm denotes the graph Laplacian corresponding to Am- As before, we learn a weighting 
of the given graphs, therefore determining which measures are best suited to maximize 
prediction accuracy. 


2.6.2 Hierarchical MT-MKL 


Recall that in task clustering, parameter vectors of tasks within the same cluster are cou¬ 
pled (Equation 13). The strength of that coupling, however, has be be chosen in advance 
and remains fixed throughout the learning procedure. We extend the formulation of task 
clustering by introducing a weighting 9m to task cluster m and tuning this weighting using 
our framework. We decompose G over clusters and arrive at the following MTL regularizer 


R{wi,.. .,wt) 


2 (E„., ii“-”ii + E„... E.,,., 


( 20 ) 

( 21 ) 


where G™ is given by 


G 


m _ 



PraP 


t 

m 


p + ELi p 


r 

m 


Note that, if not all tasks belong to the same cluster, G'^ will not be invertible. Therefore, 
we need to express the mapping onto the dual of our general framework from Equation 2.5.1 
in terms of the pseudo-inverse (see Equation 19) of Gm- = Gm- 

An important special case of the above is given by a scenario where task relationships 
are described by a hierarchical structure Q (see Eigure 1(b)), such as a tree or a directed 
acyclic graph. Assuming hierarchical relations between tasks is particularly relevant to 
Computational Biology where often different tasks correspond to different organisms. In 
this context, we expect that the longer the common evolutionary history between two 
organisms, the more beneficial it is to share information between these organisms in a 
MTL setting. The tasks correspond to the leaves or terminal nodes and each inner node 
Um defines a cluster m, by grouping tasks of all terminal nodes that are descendants of 
the current node Um- As before, task clusters G can be used in the way discussed in the 
previous section. 
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2.6.3 Smooth hierarchical MT-MKL 


Finally, we present a variant that may be regarded as a smooth version of the hierarchical 
MT-MKL approach presented above. Here, however, we require access to a given task 
similarity matrix, which is then subsequently transformed by squared exponentials with 
different length scales, for instance, = exp{Astl(Tm)- We use MT-MKL to learn 

a weighting of the kernels associated with the different length scales, which corresponds 
to finding the right level in the hierarchy to trade off information between tasks. As an 
example, consider Figure 1(c), where we show the original task similarity matrix and the 
transformed matrices at different length scales. 


3 Algorithms 

In this section, we present efficient optimization algorithms to solve the primal and dual 
problems, i.e.. Problems 1 and 2, respectively. We distinguish the cases of linear and 
non-linear kernel matrices. For non-linear kernels, we can simply use existing MKL imple¬ 
mentations, while, for linear kernels, we develop a specifically tailored large-scale algorithm 
that allows us to train on problems with a large number of data points and dimensions, 
as demonstrated on several data sets. We can even employ this algorithm for non-linear 
kernels, if the kernel admits a sparse, efficiently computable feature representation. For 
example, this is the case for certain string kernels and polynomial kernels of degree 2 or 3. 
Our algorithms are embedded into the COFFIN framework (Sonnenburg and Franc, 2010) 
and integrated into the SHOGUN large-scale machine learning toolbox (Sonnenburg et ah, 
2010 ). 


3.1 General Algorithms for Non-linear Kernels 

A very convenient way to numerically solve the proposed framework is to simply exploit 
existing MKL implementations. To see this, recall from Section 2.4 that if we use the 
multi-task kernels ki,..., kM as defined in (5) as the set of multiple kernels, the completely 
dualized MKL formulation (see Problem 3) is given by. 


inf sup 

esBp cteK":EILi 


n M 

ij=l m=l 


l<m<M 


C£*{-a/C). 


An efficient optimization approach is by Vishwanathan et al. (2010), who optimize the 
completely dualized MKL formulation. This implementation comes along without a 0-step, 
but any of the Oi-steps computations of the Oj-steps are more costly as in the case of vanilla 
(MT-)SVMs. 

Further, combining the partially dualized formulation in Problem 2 with the definition 
of multi-task kernels from (5), we arrive at an equivalent problem to (7), that is, 


inf sup 

Q,gR>^ 


^ n M 

“ 2 ^ X] km{Xi,Xj) 

ij=l m=l 


CSl*{-a/C), 
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which is exactly the optimization problem of £p-norm multiple kernel learning as described 
in Kloft et al. (2011). We may thus build on existing research in the field of MKL and 
use one of the prevalent efficient implementations to solve £p-norm MKL. Most of the ip- 
norm MKL solvers are specifically tailored to the hinge loss. Proven implementations are, 
for example, the interleaved optimization method of Kloft et al. (2011), which is directly 
integrated into the SVMLight module (Joachims, 1999) of the SHOGUN toolbox such that 
the 0-step is performed after each decomposition step, i.e., after solving the small QP 
occurring in SVMLight, which allows very fast convergence (Sonnenburg et al., 2006). 

For an overview of MKL algorithms and their implementations, see the survey paper by 
Gonen and Alpaydin (2011). 

3.2 A Large-scale Algorithm for Linear or String Kernels and Beyond 

For specific kernels such as linear kernels and string kernels—and, more generally, any kernel 
admitting an efficient feature space representation—, we can derive a specifically tailored 
large-scale algorithm. This requires considerably more work than the algorithm presented 
in the previous subsection. 

3.2.1 Overview 

From a top-level view, the upcoming algorithm underlies the core idea of alternating the 
following two steps: 

1. the 6 step, where the kernel weights are improved 

2. the W step, where the remaining primal variables are improved. 


Algorithm 1 (Blueprint of the large-scale optimization algorithm). The MKL 
module {9 step) is wrapped around the MTL module {W step). 

1: input: data xi,...,a;„ € X and labels yi,...,yn € {—1,1} associated with tasks 
t(1), ..., T(n) € {!,..., Tj; feature vectors (j)i{xi), ... ,4>M{xi)] task similarity matrices 
Qi,..., Qm] optimization precision e 
2: initialize Om '■= ^/iJM for all m = 1,..., M, initialize W = 0 
3: while optimality conditions are not satisfied within tolerance e do 
4: W descent step: compute new W such that the obj. 9le(W) -t C2{W) decreases 

5: W := argmin^ 91e(W) -L C £,{W) 

6: 6 step: compute minimizer 6 := argming^Q 91g(W) -I- C2,{W) according to (22) 

7: end while 

8: output: e-accurate optimal hypothesis W and kernel weights 0 


These steps are illustrated in Algorithm Table 1. We observe from the table that the vari¬ 
ables are split into the two sets {9m\'ni = 1,..., M} and {wmt\'ni = 1,..., M, t = 1 ,..., Tj. 
The algorithm then alternatingly optimizes with respect to one or the other set until the 
optimality conditions are approximately satisfied. We will analyze convergence of this op¬ 
timization scheme later in this section. Note that similar algorithms have been used in the 
context of the group lasso and multiple kernel learning by, for instance, Roth and Fischer 
(2008), Xu et al. (2010), and Kloft et al. (2011). 


16 





3.2.2 Solving the 6 Step 


In this section, we discuss how to compute the update of the kernel weights 6 as carried 
out in Line 6 of Algorithm 1. Note that for fixed W £ Ti it holds 

arginf = aigint ^Re{W), 

0e0p e& 0 p 

where yi 0 {W) = ^ • Furthermore, by Lagrangian duality. 


1 Y tljWmQmW^) 
e&Qp 2 ^ Om 


max 

A>0 



m=l 


tr{WmQ m Wrn) 


M 

+ ^ ^ 

m=l 


M 


inf - V + A* V C 

eyo 2 ^ Om 

m=l m=l 


where we denote the optimal A in the above maximization by A*. The infimum is either 
attained at the boundary of the constraints or when = 0) thus the optimal point 

0* satisfies 0^ = (tr(ITm(5mITm)/A*)^'^*'^’''^^ for any m = Because 6* £ Op, 

i.e., ||0||p = 1, it follows A* = , under the minimal 

assumption that VF / 0. Thus, because tT{WmQ'mWm) = ^Imst {wms,Wjnt), 


Vm = 1,..., M : = 


f—] Qmst i'^msi 'Wmt) 
Xm=l {Wms,Wmt) 


1/p ■ 


( 22 ) 


3.2.3 Solving the W Descent Step 

To solve the IT step as carried out in Line 4 of Algorithm 1, we consider the kernel weights 
{6*m|uT. = 1,...,M} as being fixed and optimize solely with respect to W. In fact, we 
perform the W descent step in the dual, i.e., by optimizing the dual objective of Problem 2, 
i.e., solving 

sup -in;(A*(Q)) - C£*{-{a)/C). 

ctSM" 

Although our framework is also valid for other loss functions, for the presentation of the 
algorithm, we make a specific choice of a proven loss function, that is, the hinge loss I (a) = 
max(0,1 — a), so that by (8), the above task becomes 

n 

sup - ^R*q{A*( a)) + (23) 

q;GR^:0:<q::^C 

Our algorithm optimizes (23) by dual coordinate ascent, i.e., by optimizing the dual vari¬ 
ables ai one after another (i.e., only a single dual variable a* is optimized at a time), 

n 

sup - {a + dsi)) + ai + d, 

d£R:0<ai+d<C ^ 
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where we denote the unit vector of ith coordinate in M”" by e*. As we will see, this task can 
be performed analytically; however, performed purely in the dual involves computing a sum 
over all support vectors which is infeasible for large n. Our proposed algorithm is, instead, 
based on the application of the representer theorem carried out in Section 2.3: recall from 
(4) that, for all m = 1,..., M and t = 1,..., T, it holds 

n 

'^mt — Gm ^ ^ ’ 

i=l 

The core idea is to express the update of the a* in the coordinate ascent procedure solely 
in terms of the vectors Wmt- While optimizing the variables a* one after another, we keep 
track of the changes in the vectors Wmt- This procedure is reminiscent of the dual coordinate 
ascent method, but differs in the way the objective is computed. Of course, this implies that 
we need to manipulate feature vectors, which explains why our approach relies on efficient 
infrastructure of storing and computing feature vectors and their inner products. If the 
infrastructure is adequate so that computing inner products in the feature space is more 
efficient than computing a row of the kernel matrix, our algorithm will have a substantial 
gain. 


Expressing the update of a single variable ccj in terms of the vectors Wmt As 

argued above, our aim is to express the (analytical) computation of 

n 

sup — {a + dei)) + ai + d. 

d&R:0<ai+d<C 

solely in terms of the vectors Wmt- To start the derivation, note that, by (3), 

M 

%{A*{cx + dei)) = - Y^em\\A*{a + dei)\\l-, 


m=l 


with, by (6), 

n n 

\\A*{a +dei)f^-i = ^ aja--yjy^km{xj,x^) + 2dyi'^ajyjkm{xi, xj) + d^km{xi,Xi), 
where 


ij=i 


i=i 


km{XiiXj) — ) 

is the mth multi-task kernel as defined in (5). Thus, 


argsup — iyig{A*(a + dei)) -|- eg + d 
d&R:0<ai+d<C 


~ \ 1 2 ~ \ 
argsup d - dyi} ajyj > 9mkm{xi,Xj) - xd > , , 9mkm{xi,Xi) 

■. 0 <a,+d<c ^ \^m=l ) 2 ) 

argsup d-dyi(y~] AWmT(i),‘fra{Xi))]-l:d^(y~] 9mkm{Xi,Xi)] . 

■.0<a,+d<C 2 y-^va=\ ) 


= :'4}(d) 
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The optimum of ^{d) is either attained at the boundaries of the constraint 0 < ai + d < C 
or when 'ilj'{d) = 0. Hence, the optimal d* can be expressed analytically as 


f Ui 


d* = max —ai, min C — ai, 


Whenever we update an ai according to 


Xym=l dmkmi^Xi, Xj) 


_ ^oid ^ 


(24) 


with d computed as in (24), we need to also update the vectors Wmt, m = 
t = 1,..., T, according to 

wZt ■= > (25) 

to be consistent with (4). Similarly, we need to update the vectors Wmt after each 6 step 
according to 

(C'/S™") ■ (26) 

To avoid recurrences in the iterates, a 0-step should only be performed if the primal objective 
has decreased between subsequent 0-steps. Thus, after each a epoch, the primal objective 
needs to be computed in terms of VH. As described above, the algorithm keeps W up to 
date when a changes, which makes this task particular simple. 

The resulting large-scale algorithm is summarized in Algorithm Table 2. Data and the 
labels are input to the algorithm as well as a sub-procedure for efficient computation of 
feature maps (cf. Section 3.2.4). Lines 2 and 3 initialize the optimization variables. In 
Line 4 the inverses of the task similarity matrices are pre-computed. Algorithm 2 iterates 
over Lines 7-16 until the stopping criterion falls under a pre-defined accuracy threshold e. 
In Lines 7-11 the line search is computed for all dual variables. Lines 14 and 15 update 
the primal variables and kernel weights to be consistent with the representer theorem, only 
if the primal objective has decreased since the last 0-step. We stop Algorithm 2 when the 
relative change in the objective o is less than e. Notice that we do not optimize the W step 
to full precision, but instead alternate between one pass over the a* and a 0 step. 


3.2.4 Details on the Implementation 

We have implemented the optimization algorithms described in the previous section into 
the general framework of the SHOGUN machine learning toolbox (Sonnenburg et al., 2010). 
Besides the described implementations for binary classification, we also provide implemen¬ 
tations for novelty detection and regression. Furthermore, the user may choose an opti¬ 
mization scheme, that is, decide whether one of the classic, non-linear MKL solvers shall 
be used (either the analytic optimization algorithm of Kloft et al. (2011), the cutting plane 
method of Sonnenburg et al. (2006), or the Newton algorithm by Kloft et al. (2009a)), or 
the novel implementation for efficiently computable feature maps. Our implementation can 
be downloaded from http://www.shogun-toolbox.org. 

In the more conventional family of approaches, the wrapper algorithms, an optimiza¬ 
tion scheme on 0 wraps around a conventional SVM solver (for instance, LIBSVM and 
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Algorithm 2 (Dual-coordinate-ascent-based MTL training algorithm). Gener¬ 
alization of the LibLinear training algorithm to multiple tasks and multiple linear kernels. 
1: input: data Xi,...,a:„ G X and labels yi,...,y„ G {—1,1} associated with tasks 
t( 1), ..., r(n) G {l,...,r}; efficiently computable feature maps (pi,... task similarity 
matrices Qi,, Qm] optimization precision e 
2: for alH G {1,..., n} initialize Oi = 0 

3: for all m G {1,..., M} and t G {1,..., Tj, initialize w^t according to (4) 

4: for all TO G {1,... ,M), compute inverse 

5: initialize primal objective o = nC 

6: while optimality conditions are not satisfied do 

7: for alH G {1,..., nj 

8: compute d according to (24) 

9: update at := ai + d 

10: for all TO G {1,...,M} and t G {1,... ,T}, update w^t according to (25) 

11: end for 

12: store primal objective o°*'^ = o and compute new primal objective o 

13: if primal objective has decreased, i.e., o < 

14: for all TO G {1,..., M}, compute 6^ from w^i, • • •, according to (22) 

15: for all TO G {1,...,M} and t G {1,... ,T}, update w^t according to (26) 

16: end if 

17: end while 

18: output: e-accurate optimal hypothesis W = {Wmt)i<m<M,i<t<T and kernel weights 9 = 

{9m)l<m<M 


SVMLIGHT are integrated into SHOGUN) using a single multi-task kernel. Effectively, 
this results in alternatingly solving for a and 6. For the 0-step, SHOGUN offers the three 
choices listed above. The second, much faster approach performs interleaved optimization 
and thus requires modification of the core SVM optimization algorithm. This is currently 
either integrated into the chunking-based SVRlight and SVMlight module. Lastly, the 
completely new optimization scheme as described in Algorithm Table 2 is implemented and 
connected with the module for computing the 0-step. 

Note that the implementations for non-linear kernels come with the option of either 
pre-computing the kernel or computing the kernel on the fly for large-scale data sets. For 
truly large-scale MT-MKL, a linear or string kernel should be used. This is implemented 
as an internal interface the COFFIN module of SHOGUN (Sonnenburg and Franc, 2010). 

3.3 Convergence Analysis 

In this section, we establish convergence of Algorithm 1 under mild assumptions. To this 
end, we build on the existing theory of convergence of the block coordinate descent method. 
Classical results usually assume that the function to be optimized is strictly convex and 
continuously differentiable. This assertion is frequently violated in machine learning when, 
for instance, the hinge loss is employed. In contrast, we base our convergence analysis on the 
work of Tseng (2001) concerning the convergence of the block coordinate descent method. 
The following proposition is a direct consequence of Lemma 3.1 and Theorem 4.1 in Tseng 
( 2001 ). 
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Proposition 4. Let f : —)• M U { 00 } be a function. Put d = di + ■ ■ ■ + d^. 

Suppose that f can be decomposed into /(ai,..., ar) = fo{cii, • • •, Or) + fr{o-r) for 

some /o : —)• M U { 00 } and fr ; —)■ M U { 00 }, r = 1,... ,R. Initialize the block 

coordinate descent method by of = (a^,..., of^. Let {rk)k&n C {1,..., i?} be a sequence of 
coordinate blocks. Define the iterates of = {a\,..., afffj, k > 0, by 


G argmin/(a^+\ 
Assume that 


,a 


fc+i 

ffe-i’ 


21 , a 


k 




:= , r ^ rk , 


k € No. 


(27) 


(Al) f is convex and proper (i.e., f ^ oo) 

(A 2 ) the sublevel set := {a G : f{a) < /(a°)} is compact and f is continuous on 
(ASSURES EXISTENCE OF MINIMIZER IN (27 )) 

(A3) dom(/o) := {a G : fo{o) < 00 } is open and fo is Gateaux differentiable (for 
instance, continuously differentiable) on dom(/o) 

(^YIELDS REGULARITY—I.E., ANY COORDINATE-WISE MINIMUM IS A MINIMUM OF f) 

(A 4 ) it exists a number T G N so that, for each k G N and r G i?}, there is 

k G {k,... ,k + T} with rj^ = r . 

(^ENSURES THAT EACH COORDINATE BLOCK IS OPTIMIZED “SUFFICIENTLY OFTEN” j 

Then the minimizer in (27) exists and any cluster point of the sequence {a^)k^N minimizes 
f over A. 


Corollary 5. Assume that 

(B1) the data is represented by 4>rn{xi) G i = 1,... ,n, Cm < 00 , m = 1,..., M. 

(B2) the loss function I is convex, finite in 0, and continuous on its domain dom(/) 

(B3) the task similarity matrices Qi,..., Qt are positive definite 

(B3) any iterate 6 = ( 6 * 1 ,, 9m) traversed by Algorithm 1 has 9m > 0, rn = 1,..., M 

(Bf) the exact search specified in Line 5 of Algorithm 1 is performed 

Then Algorithm 1 is well-defined and any cluster point of the sequence traversed by the 
Algorithm 1 is a minimal point of Problem 1. 


Proof. The corollary is obtained by applying Proposition 4 to Problem 1, that is, 


inf 

w,e: e&0p 



(28) 


where 0p = {0 G ■ 9m > 0,rn = 1,..., M, ||0||p < 1} and, by (Bl), W G 
e = ei Cm- Note that (28) can be written unconstrained as 


mf f{W,e), where f{W, 6 ) := fo{W,e) + fffW) + f 2 {e), (29) 

Vv , 0 
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by putting 

1 ^ IIP^ ||2 

MW,6) ■■= ft + heyo}{e) 

L Uffi 

m=l 

as well as 

n / M \ 

h{W) := C mT(i)) V2m(a^i)) j , /2(^) (^) ; (^0) 

i=l ^ m=l ^ 

where / is the indicator function, /s(s) = 0 if s G 5 and /s(s) = oo elsewise. Note that we 
use the shorthand 0 ^ 0 for 0m > 0, m = 1,..., M. 

Assumption (B4) ensures that applying the block coordinate descent method to (28) 
and (29) problems yields precisely the sequence of iterates. Thus, in order to prove the 
corollary, it suffices to validate that (29) fulfills Assumptions (Al)”(A4) in Proposition 4. 

Validity of (A1) Recall that Algorithm 1 is initialized with = 0 and 6^ = 
^l/M, m = 1,..., M, so it holds 

f{W\e^) = fo{W\e^) +h{w^)+ / 2 ( 0 °) = Cnm < oo, (31) 

=0 =Cnl{0) =0 


hence / ^ oo, so / is proper. Furthermore, dom(/o) = {{W,6) : 0 ^ 0} is convex, and /o 
is convex on dom(/o), so /o is a convex function. By (B2), the loss function I is convex, so 
/i is a convex function. The domain dom(/ 2 ) = {0 : ||0||p < 1} is convex, and /2 = 0 on 
its domain, so /2 is a convex function. Thus the sum / = /o + /i + /2 is a convex function, 
which shows (Al). 

Validity of (A2) Let {W,e) e := {{W,e) : f{W,e) < /(Vr°,0°)}. We have 
/o, fi,f2 > 0, so, for all m = 1,..., M, 


ll^m||Q„ ^ fo{W,0) < /o(1V,0) + /i(LV)+/2(0) /(1V,0) 


20 „ 


>0 >0 


(32) 


< /(VF°,0°) ■< ^ Cnl{0), 


by (31) 


which implies UlTmllg^ < 29mCnl{0). Similar, because /o > 0, we have / 2 (W, 0) < 
Cnl{0) < oo, which, by (30), implies ||0||p < 1 and thus 9m < m = Hence, 

by (32), lllTmllg^ < 2 Cnl{0), m = Because Qi, ■ ■ ■ ,Qm are positive definite, 

u := minm=i,...,M ti'(Qm) > 0. Thus, for any m = 1,..., M, 

llWmf = tr(W^Wm) = tr(lCWm)tr(Qm)/tr(gm) < tr(W;:,lTmQm)/tr(gm) 

< U-Ml{W:;,WmQm) = iy-Ml{WmQmW;,) = 12-^\\Wm\\Q^ < 212-^711(0). 

Thus _ _ _ _ 

\\{W,e)f = \\wf + \\ef < 2u-^CMnl(0) + M < oo. 

Thus sup^^g^g^o ll(W^)^)ll < CO, which shows that is bounded. Furthermore, C 
dom(/) = dom(/o) Pi dom(/i) n dom(/ 2 ) and /o,/i ,/2 are continuous on their respective 
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domains. Thus / is continuous on dom(/) and thus also on its subset It holds = 
f~^{] — oo, f{W^,6^)]), i.e., is the preimage of closed set under a continuous function; 
thus is closed. Any closed and bounded subset of is compact. Thus is compact, 
which was to show. 

Validity of (A3) and (A4) Clearly, dom(/o) = {{W,6) : 0 ^ 0} is open and /o 
is continuously differentiable on dom(/o). Thus it is Gateaux differentiable on dom(/o). 
Finally, assumption (A4) is trivially fulfilled as Algorithm 1 employs a simple alternating 
rule for traversing the blocks of coordinates. 

In summary, Proposition 4 can thus be applied to Problem 1, which yields the claim of 
the corollary. □ 

Remark 6. In this paper, we experiment on finite-dimensional string kernels, so Assump¬ 
tion (Bl) is naturally fulfilled. Note that, more generally, (/){xi) G for all i = 1,... ,n, 
m = I,..., M, can he enforced also for infinite-dimensional kernels, as, for any finite sam¬ 
ple xi,..., Xn, there exists a n-dimensional feature representation of the sample that can he 
explicitly computed in terms of the empirical kernel map (Scholkopf et al, 1999). 


4 Applications 

We demonstrate the performance of different facets of our framework with several experi¬ 
ments ranging from well-controlled toy data to a large scale experiment on a highly relevant 
genomes data set, where we combine data from a diverse set of organisms using multitask 
learning. We start with a review of our prior experimental work based on algorithms that 
are closely related to the ones described in this work. 

4.1 Previous work 

The theoretical framework presented in this paper is a generalization of the methods success¬ 
fully used in our previous work. Special cases of the above framework were investigated in 
the context of genomic signal prediction (Schweikert et ah, 2008; Widmer et ah, 2010a), se¬ 
quence segmentation with structured output learning (Gornitz et ah, 2011), computational 
immunology (Widmer et ah, 2010b,c; Toussaint et ah, 2010) and problems from biological 
imaging (Lou et ah, 2012; Widmer et ah, 2014; Lou et ah, 2014). Further, we have inves¬ 
tigated an efficient algorithm to solve special cases of our method on a large number of 
machine learning data sets in Widmer et ah (2012). We have previously summarized some 
of our earlier work in (Widmer and Ratsch, 2012; Widmer et ah, 2013a,b). An example of 
earlier results from Widmer et ah (2010a) is given in Figure 2. It illustrates an applica¬ 
tion of the MTL algorithm to a case where we have multiple datasets associated with 15 
organisms. Their evolutionary relationship is assumed to be known and is used for inform¬ 
ing task relatedness in the algorithm that is described in Section 2.5.3 and Widmer et ah 
(2010a). This experiment exemplifies the successful application of MTL to applications in 
computational biology for the joint-analysis of multiple related problems. 

In the two experiments that will be described in the sequel, we will go beyond our 
previous work by investigating our framework in its full generality. 
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Figure 2: Results from multitask learning on several organisms. Shown is a subset of the 
results reported in Widmer et al. (2010a), where we combined splice site data from 15 or¬ 
ganisms. We compared a multitask learning approach to baseline methods individual (each 
task is learned independently) and union (all data is simply pooled). As for multitask learn¬ 
ing, we used only a single, fix similarity measure, which we inferred from the evolutionary 
history of the organisms at hand. These and other results in Schweikert et al. (2008); Wid¬ 
mer et al. (2010a); Gornitz et al. (2011); Widmer et al. (2010b,c); Toussaint et al. (2010); 
Lou et al. (2012); Widmer et al. (2014); Lou et al. (2014) illustrate the power multitask 
learning in related tasks in computational biology. 


4.2 Experiments on Biologically Motivated Controlled Data 

In this section, we evaluate Hierarchical MT-MKL as described in Section 2.6.2 on an 
artificial data set motivated by biological evolution. At the core of this example is the 
binary classification of examples generated from two 100-dimensional isotropic Gaussian 
distributions with a standard deviation of fi = 20. The difference of the mean vectors 
Upos and Uneg is captured by a difference vector We set Upos = O.S/r^ and fj-neg = 
-O.bfid- To turn this into a MTL setting, we start with a single Hd = (Ij • • • j 1)^ and 
apply mutations to it. These mutations correspond to flipping the sign of m dimensions 
in ^d, where m = 5. Inspired by biological evolution, mutations are then applied in a 
hierarchical fashion according to a binary tree of depth 4 (corresponding to 2^ = 32 leaves). 
Starting at the root node, we apply subsequent mutations to the fid at the inner nodes of the 
hierarchy and work down the tree until each leaf carries its own fid- We sample 10 training 
points and 1,000 test points for each class and for each of the 32 tasks. The similarity 
between the fid at the leaves is computed by taking the dot product between all pairs and 
is shown in Figure 3(a). Glearly, this information is valuable when deciding which tasks 
(corresponding to leaves in this context) should be coupled and will be referred to as the 
true task similarity matrix in the following. We use Hierarchical MT-MKL as described in 
Section 2.6.2 by creating adjacency matrices for each inner node and subsequently learning 
a weighting using MT-MKL. 
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We compare MT-MKL with p = 1,2,3 to the following baseline methods: Union that 
combines data from all tasks into a single group. Individual that treats each task separately 
and Vanilla MTL that uses MTL with the same weight for all matrices. We report the 
mean (averaged over tasks) ROC curve for each of the above methods in Figure 3(b). 



Receiver operating characteristic 



— MTMKL p-norm = 1.00 (auc 

— MTMKL p-norm = 2.00 (auc 

— MTMKL p-norm = 3.00 (auc 
Vanilla (auc = 0.7660) 

— Individual (auc = 0.6841) 

— Union (auc = 0.7876) 


= 0.7939) 
= 0.8203) 
= 0.8293) 


(a) True Task similarity matrix (see main text) (b) Performance of Hierarchical MT-MKL vs. Base¬ 
line methods 


Figure 3: Illustration of Hierarchical MT-MKL on an artificial dataset: In 3(a), we show 
the similarity matrix between all 32 tasks as generated by a biologically inspired scheme, 
where generating parameters are mutated according to a given tree structure (see main text 
for details). Comparison of MT-MKL to baselines Vanilla MTL, Union, Individual is shown 
in 3(b), where ROC curves are averaged over the 32 tasks for each method. MT-MKL with 
p = 2 and p = 3 perform best for this task. 

From Figure 3(b) we observe that the baseline Individual performs worst by a large 
margin, suggesting that combining information from several tasks is clearly beneficial for 
this data set. Next, we observe that a simple way of combining tasks (i.e.. Union) already 
considerably improves performance. Furthermore, we observe that learning weights of hi¬ 
erarchically inferred task grouping in fact improves performance compared to Vanilla for 
non-sparse MT-MKL (i.e., p = 2,3). Of all methods, non-sparse MT-MKL is most accurate 
for all recall values. 


4.3 Genomic Signal - Transcription Start Site (TSS) Prediction 

In this experiment, we consider an application from genome sequence analysis. The goal 
is to accurately identify the genomic signal called transcription start site (TSS) based on 
the surrounding genomic sequence. TSS is the genomic location where transcription, the 
process whereby the RNA copies are made from regions of the genome, is initiated at the 
genome sequences. We have obtained genomic data from ENSEMBL (Hubbard et ah, 2002), 
a community resource that brings together genomic sequences and their annotations. From 
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this, we compiled a data set for nine organisms {E. caballus, C. briggsae, M. musculus, 
C. elegans, D. rerio, D. simulans, V. vinifera, A. thaliana, and H. sapiens), where we 
took annotated instances of transcription starts as positive examples and sequences around 
randomly selected positions in the genome as background. We use our framework to jointly 
learn models for different organisms, treating different organisms as different tasks. 


Task similarity To generate an initial task similarity matrix, we extracted the phyloge¬ 
netic similarity between different organisms based on their genomic sequences. In particular, 
we computed the Hamming distance between well-conserved 16S ribosomal RNA regions 
(i.e., stretches of genomic sequence with low degree of change during evolution) between 
different classes of organisms (Isenbarger et ah, 2008). Subsequently, we either used this 
similarity directly in our multitask learning algorithms (MTL) or attempted to refine it 
further using MT-MKL. To create a set of task similarities to be weighted by MT-MKL, 
we applied exponential transformations to the base task similarity at different length-scales 
(cj = {0.1, 7.55,15.0}; see Section 2.6.3). 


Experimental Setup and Results We have collected 4,000 TSS signal sequences for 
each organism, which includes 1,000 positive and 3,000 negative label sequences for training 
and testing. Both ends of the TSS signal label sequence consist of 1, 200 flanking nucleotides. 
On this data set, we evaluated the two baseline methods, MTL and MT-MKL. In the used 
evaluation scheme, we split the data in training set, validation set and testing set for each 
organism. We use ten splits. The best regularization constant is selected on the validation 
split for each organism. In Figure 4 we report the average area under the ROC curve (AUC) 
over the ten test sets, for each of which the best regularization parameter was chosen on a 
separate evaluation set. 

From Figure 4, we observe that four out of nine organisms the single-task SVM (individ¬ 
ual) outperforms the SVM that is trained on training instances from all organisms pooled 
(union). From which we conclude that the learning tasks are substantially dissimilar. On 
the other hand, we observe that for some organisms (M. musculus, D. rerio, V. vinifera, 
and H. sapiens), there is an improvement by union over individual, which indicates that 
these tasks are more similar than the remaining tasks. This is an indicator that MTL may 
be beneficial for this data. See also discussion in Widmer et al. (2013b). Indeed, MTL 
improves (at least marginally) over Union and Individual in seven and five out of nine or¬ 
ganisms, respectively. But it is surpassed by Individual for three organisms (A. thaliana, 
C. briggsae, C. elegans, and D. simulans). While the overall performance of of MTL is 
slightly better than Union and Individual, the differences are minor which we attribute 
a possibly suboptimally chosen task similarity matrix. (In fact, practically speaking, we 
find that selecting a good task similarity matrix is the most difficult aspect of Multitask 
learning.) 

The proposed MT-MKL on the other hand, improves over individual on eight out of 
nine organisms (and is not much worse on the nineth task). It improves over MTL by close 
to 5% AUC for some organisms. On average, it performs about 2.5% better than any other 
considered algorithm. MT-MKL achieves this by refining task similarities and thus is able 
to improve classification performance. 
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Figure 4: Average AUC achieved by the proposed MT-MKL as well as the baseline methods, 
on the gene-start dataset (TSS). MT-MKL improves the mean accuracy considerably. In 
addition, the accuracy of MT-MKL is best in eight out of the nine organisms. 


In summary, we are able to demonstrate that multitask learning and MT-MKL strategies 
are beneficial when combining information from several organisms and we believe that this 
setting has potential for tackling future prediction problems in computational biology, and 
potentially also to other application domains of multitask and multiple kernel learning such 
as computer vision (Lou et ah, 2012; Kloft et ah, 2009b; Binder et ah, 2012; Widmer et ah, 
2014; Lou et ah, 2014) and computer security (Kloft et ah, 2008a; Kloft and Laskov, 2012; 
Gornitz et ah, 2013). 

5 Conclusion 

We presented a general regularization-based framework for Multi-task learning (MTL), 
in which the similarity between tasks can be learned or refined using .^p-norm Multiple 
Kernel learning (MKL). Based on this very general formulation (including a general loss 
function), we derived the corresponding dual formulation using Fenchel duality applied to 
Hermitian matrices. We showed that numerous established MTL methods can be derived 
as special cases from both, the primal and dual of our formulation. Furthermore, we derived 
an efficient dual-coordinate descend optimization strategy for the hinge-loss variant of our 
formulation and provide convergence bounds for our algorithm. Combined with our efficient 
integration into the SHOGUN toolbox using the COFFIN feature hashing framework, the 
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approach could be used to process a large number of training points. The solver can also be 
used to solve the vanilla .^p-norm MKL problem in the primal very efficiently, and potentially 
extended to more recent MKL approaches (Cortes et ah, 2013). Our solvers including all 
discussed special cases are made available as open-source software as part of the SHOGUN 
machine learning toolbox. 

In the experimental part of this paper, we analyzed our algorithm in terms of predictive 
performance and ability to reconstruct task relationships on toy data, as well as on prob¬ 
lems from computational biology. This includes a study at the intersection of multitask 
learning and genomics, where we analyzed 9 organisms jointly. In summary, we were able 
to demonstrate that the proposed learning algorithm can outperform baseline methods by 
combining information from several organisms. 

In the future we would investigate the theoretical foundations of the approach (a good 
starting point to this end is the work by Kloft and Blanchard (2011, 2012)), extensions 
to structured output prediction (Gornitz et ah, 2011), and to apply the method to further 
problems from computational biology and the biomedical domain. These settings have great 
potential; for instance, a Bayesian adaption of our approach was very recently shown to be 
the leading model in an international comparison of 44 drug prediction methods for breast 
cancer (Costello et ah, 2014). 
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A Fenchel Duality in Hilbert Spaces 

In this section, we review Fenchel duality theory for convex functions over real Hilbert 
spaces. The results presented in this appendix are taken from Chapters 15 and 19 in 
Bauschke and Combettes (2011). For complementary reading, we refer to the excellent 
introduction of Bauschke and Lucet (2012). Fenchel duality for machine learning has also 
been discussed in Rifkin and Lippert (2007) assuming Euclidean spaces. We start the 
presentation with the definition of the convex conjugate function. 

Definition 7 (Convex conjugate). Let LL be a real Hilbert space and let g : H ^ {oo} 
be a convex function. We assume in the whole section that g is proper, that is, {m G 
H I g{w) G M} / 0. Then the convex conjugate g* : 7^ —)■ M U {oo} is defined by g*{w) = 
sup„6^('u,w) -g{v). 


As the convex conjugate is a supremum over affine functions, it is convex and lower semi- 
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continuous. We have the beautiful duality 


g = g 






g is convex and 
lower semi-continuous. 


This indicates that the “right domain” to study conjugate functions is the set of convex, 
lower semi-continuous, and proper (“ccp”) functions. In order to present the main result of 
this appendix, we need the following standard result from operator theory. 

Proposition 8 (Definition and uniqueness of the adjoint map). Let % he a real Hilbert space 
and let A : H ^ H be a eontinuous linear map. Then there exists a unique continuous linear 
map A* : H ^ H with {A{w),a.) = {w,A*a.), whieh is called adjoint map of A. 


For example, in the Euclidean case, we have H = M™, H = M”, and A G go that 

simply the transpose A* = A~^ G We now present the main result of this appendix, 

which is known as Fenehel’s duality theorem: 

Theorem 9 (Fenchel’s duality theorem). Let H.,H. be real Hilbert spaces and let g : H ^ 
M U {00} and h : H ^ 'RU {00} be cep. Let A : H ^ H be a eontinuous linear map. Then 
the primal and dual problems, 

p* = inf g{w) -|- h{A{w)) 
weH 

d* = sup —g*{A*{a.)) — h*{—a ), 
aen 

satisfy weak duality (i.e., d* < p*). Assume, furthermore, that A{dom{g)) n cont{h) / 0, 
where dom(/) ;= {m G H : g{w) < 00} and cont(h) := {a £ H : h eontinuous in a }. Then 
we even have strong duality (i.e., d* = p*) and any optimal solution satisfies 

w* = Vg*{A*{a*)), 

if g* o A* is (Gateaux) differentiable in a *. 


When applying Fenchel duality theory, we frequently need to compute the convex conjugates 
of certain functions. To this end, the following computation rules are helpful. 

Proposition 10. The following eomputation rules hold for the convex eonjugate: 

1. Let 5 : ^ — 7- M U {00} he a proper convex function on a real Hilbert space H. Then, 
for any A > 0 and w £%, we have {Xg)*{w) = Xh*{w/X). 

2. Furthermore, assume that H = ^i 0^2 and g{w) = 5 ri(mi) -|- 52 ("if 2 ), where gi : 
Hi — )■ M U {00} and g 2 : H 2 —)• M U {00}, are proper convex functions on Hilbert 
spaces Hi and H 2 , respeetively. Then, for any w = {wi,W 2 ) G Hi^H 2 , we have 
g*{w) = gl{wi) + g^{w 2 ). 
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B Conjugate of the Logistic Loss 

The following lemma gives the convex conjugate of the logistic loss. 


Lemma 11 (Conjugate of Logistic Loss). The conjugate of the logistic loss, defined as 
l{a) = log(l + exp(—a), is given by 

I*(a) = -t log(-a) + (1 + a) log(l + a). 


Proof. By definition of the conjugate, 


l*{a) 


sup ab — log(l + exp(—6)) . 
bm '' -V-' 

=:tp{b) 


Note that the problem is unbounded for a < —1 and a > 0. For a g] — 1, 0[, the supremum 
is attained when = 0, which translates into b = — log(—a/(1 + a)) and 1 + exp(—6) = 
1/(1 + a). Thus 


l*{a) = -alog(-a/(l + a)) - log(l/(l + a)) = -alog(-a) + (1 + a) log(l + a), 


which was to show 


□ 


References 

A. Agarwal, H. Daume III, and S. Gerber. Learning Multiple Tasks using Manifold Regu¬ 
larization. In Advances in Neural Information Proeessing Systems 23, 2010. 

W.-K. Ahn and W. F. Brewer. Psychological studies of explanation—based learning. In 
Investigating explanation-based learning, pages 295-316. Springer, 1993. 

R. K. Ando and T. Zhang. A Framework for Learning Predictive Structures from Multiple 
Tasks and Unlabeled Data. Journal of Machine Learning Research, 6(6):1817-1853, 2005. 
ISSN 15324435. 

A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. NIPS 2007, 2007. 

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine 
Learning, 73(3):243-272, 2008a. ISSN 0885-6125. 

A. Argyriou, C. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for 
multi-task structure learning. Advances in Neural Information Proeessing Systems, 20: 
25-32, 2008b. 

A. Argyriou, C. Micchelli, and M. Pontil. When is there a representer theorem? Vector 
versus matrix regularizers. The Journal of Machine Learning Research, 10:2507-2529, 
2009. ISSN 1532-4435. 

H. Bauschke and Y. Lucet. What is a fenchel conjugate? Notices of the AMS, 59:44-46, 
2012. 

H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in 
Hilbert spaces. CMS Books in mathematics. Springer, New York, 2011. ISBN 1441994661. 


30 



J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 
12(1):149-198, Feb. 2000. ISSN 1076-9757. 

A. Binder, S. Nakajima, M. Kloft, C. Miiller, W. Samek, U. Brefeld, K.-R. Miiller, and 
M. Kawanabe. Insights from classifying visual concepts with multiple kernel learning. 
PloS one, 7(8):e38897, 2012. 

R. Caruana. Multitask learning; A knowledge-based source of inductive bias. In ICML, 
pages 41-48. Morgan Kaufmann, 1993. ISBN 1-55860-307-7. 

R. Caruana. Multitask Learning. Machine Learning, 28(1);41 - 75, 1997. ISSN 08856125. 
doi: 10.1023/A;1007379606734. 

J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation for learning shared structures from 
multiple tasks. In Proceedings of the 26th Annual International Conference on Machine 
Learning - ICML ’09, pages 1-8, New York, New York, USA, June 2009. ACM Press. 
ISBN 9781605585161. doi: 10.1145/1553374.1553392. 

C. Cortes, M. Kloft, and M. Mohri. Learning kernels using local rademacher complexity. 
In Advances in Neural Information Processing Systems, pages 2760-2768, 2013. 

J. C. Costello, L. M. Reiser, E. Georgii, M. G5nen, M. P. Menden, N. J. Wang, M. Bansal, 
P. Hintsanen, S. A. Khan, J.-P. Mpindi, et al. A community effort to assess and improve 
drug sensitivity prediction algorithms. Nature Biotechnology, 2014. doi;10.1038/nbt.2877, 
to appear. 

H. Daume. Frustratingly easy domain adaptation. In Annual meeting-association for com¬ 
putational linguistics, volume 45, page 256, 2007. 

T. Evgeniou and M. Pontil. Regularized multi-task learning. In International Conference 
on Knowledge Discovery and Data Mining, pages 109-117, 2004. 

T. Evgeniou, C. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. 
Journal of Machine Learning Research, 6(l):615-637, 2005. ISSN 1532-4435. 

M. Gonen and E. Alpaydin. Multiple kernel learning algorithms. Journal of Machine 
Learning Research, 12:2211-2268, 2011. 

N. Gornitz, G. Widmer, G. Zeller, A. Kahles, S. Sonnenburg, and G. Ratsch. Hierarchi¬ 
cal Multitask Structured Output Learning for Large-scale Sequence Segmentation. In 
Advances in Neural Information Processing Systems 2f, 2011. 

N. Gornitz, M. Kloft, K. Rieck, and U. Brefeld. Toward supervised anomaly detection. 
Journal of Artificial Intelligence Research, 46:1-15, 2013. 

T. Hubbard, D. Barker, E. Birney, G. Gameron, Y. Ghen, L. Glark, T. Gox, J. Cuff, 
V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, 
A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mongin, R. Pettett, M. Pocock, 

S. Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, J. Smith, W. Spooner, A. Stabenau, 
J. Stalker, E. Stupka, A. Ureta-Vidal, 1. Vastrik, and G. M. The ensembl genome database 
project. Nucleic Acids Research, 30(1):38-41, 2002. doi: doi:10.1093/nar/30.1.38. 

T. Isenbarger, C. Carr, S. Johnson, M. Finney, G. Church, W. Gilbert, M. Zuber, and G. Ru- 
vkun. The most conserved genome segments for life detection on earth and other planets. 


31 



Orig Life Evol Biosph, ASTROBIOLGY, 2008. doi: doi:10.1007/sll084-008-9148-z. 

L. Jacob and J. Vert. Efficient peptide-MHC-I binding prediction for alleles with few known 
binders. Bioinformatics (Oxford, England), 24(3);358-66, Feb. 2008. ISSN 1367-4811. doi: 
10.1093/bioinformatics/btm611. 

L. Jacob, F. Bach, and J. Vert. Clustered multi-task learning: A convex formulation. Arxiv 
preprint arXiv:0809.2085, 2008. 

T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, 
and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 
169-184, Cambridge, MA, 1999. MIT Press. 

M. Kloft and G. Blanchard. The local rademacher complexity of Ip-norm multiple kernel 
learning. In Advances in Neural Information Processing Systems, pages 2438-2446, 2011. 

M. Kloft and G. Blanchard. On the convergence rate of Ip-norm multiple kernel learning. 
The Journal of Machine Learning Research, 13(l):2465-2502, 2012. 

M. Kloft and P. Laskov. Security analysis of online centroid anomaly detection. The Journal 
of Machine Learning Research, 13(l):3681-3724, 2012. 

M. Kloft, U. Brefeld, P. Diiessel, C. Gehl, and P. Laskov. Automatic feature selection for 
anomaly detection. In Proceedings of the 1st ACM workshop on Workshop on AlSec, 
pages 71-76. ACM, 2008a. 

M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel learning. 
In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Kernels, dec 
2008b. 

M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Muller, and A. Zien. Efficient and 
accurate Ip-norm multiple kernel learning. In Advances in Neural Information Processing 
Systems 22, pages 997-1005. MIT Press, 2009a. 

M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In Machine 
Learning and Knowledge Discovery in Databases, pages 692-704. Springer, 2009b. 

M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. Journal 
of Machine Learning Research, 12:953-997, Mar 2011. 

J. Liu, S. Ji, and J. Ye. Multi-Task Feature Learning Via Efficient L2,1-Norm Minimization. 
In UAI 2009, UAI ’09, pages 339-348. AUAI Press, 2009. ISBN 9780974903958. 

X. Lou, C. Widmer, M. Kang, G. Ratsch, and A. Hadjantonakis. Structured Domain Adap¬ 
tation Across Imaging Modality: How 2D Data Helps 3D Inference. In NIPS Machine 
Learning in Computational Biology (NIPS-MLCB), 2012. 

X. Lou, M. Kloft, G. Ratsch, and F. A. Hamprecht. Structured Learning from Cheap Data, 
chapter 12, page 281ff. MIT Press, 2014. 

G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection and joint subspace 
selection for multiple classification problems. Statistics and Computing, 20(2):231-252, 
2010. 


32 



R. M. Rifkin and R. A. Lippert. Value Regularization and Fenchel Duality. Journal of 
Machine Learning Research, 8:441-479, 2007. ISSN 15324435. 

B. Romera-Paredes, H. Aung, N. Bianchi-Berttiouze, and M. Pontil. Multilinear multitask 
learning. In Proceedings of The 30th International Conference on Machine Learning, 
pages 1444-1452, 2013. 

V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of 
solutions and efficient algorithms. In Proceedings of the Twenty-Fifth International Con¬ 
ference on Machine Learning (ICML 2008), volume 307, pages 848-855. ACM, 2008. 

B. Scholkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, G. Ratsch, and A. J. 
Smola. Input space versus feature space in kernel-based methods. IEEE Transactions on 
Neural Networks, 10(5):1000-1017, 1999. 

G. Schweikert, C. Widmer, B. Scholkopf, and G. Ratsch. An Empirical Analysis of Domain 
Adaptation Algorithms for Genomic Sequence Analysis. In D. Roller, D. Schuurmans, 
Y. Bengio, and L. Bottou, editors. Advances in Neural Information Processing Systems 
21, pages 1433-1440, 2008. 

S. Sonnenburg and V. Franc. Coffin: A computational framework for linear SVMs. In 
J. Fiirnkranz and T. Joachims, editors, ICML, pages 999-1006. Omnipress, 2010. ISBN 
978-1-60558-907-7. 

S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple kernel learning. 
Journal of Maehine Learning Research, 7:1531-1565, July 2006. 

S. Sonnenburg, G. Ratsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. d. deBona, 
A. Binder, C. Gehl, and V. Franc. The shogun machine learning toolbox. The Jour¬ 
nal of Machine Learning Research, 99:1799-1802, 2010. 

S. Thrun. Is learning the n-th thing any easier than learning the first? Advances in neural 
information processing systems, pages 640-646, 1996. 

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal 
Statistieal Soeiety Series B Methodologieal, 58(l):267-288, 1996. ISSN 00359246. doi: 
10. 1111/j. 1553-2712.2009.0451C.X. 

N. Toussaint, C. Widmer, O. Kohlbacher, and G. Ratsch. Exploiting physico-chemical 
properties in string kernels. BMC bioinformatics, 11 Suppl 8(Suppl 8):S7, Jan. 2010. 
ISSN 1471-2105. doi: 10.1186/1471-2105-11-S8-S7. URL http://www.biomedcentral. 
com/1471-2105/ll/S8/S7. 

P. Tseng. Convergence of a block coordinate descent method for nondifferentiable mini¬ 
mization. J. Optim. Theory Appl, 109(3) :475-494, June 2001. ISSN 0022-3239. doi: 
10.1023/A:1017501703105. 

S. V. N. Vishwanathan, Z. sun, N. Ampornpunt, and M. Varma. Multiple kernel learning 
and the smo algorithm. In Advances in Neural Information Proeessing Systems 23, pages 
2361-2369, 2010. 

C. Widmer and G. Ratsch. Multitask Learning in Gomputational Biology. JMLR W&CP. 
ICML 2011 Unsupervised and Transfer Learning Workshop., 27:207-216, 2012. 


33 



C. Widmer, J. Leiva, Y. Altun, and G. Ratsch. Leveraging Sequence Classification by 
Taxonomy-based Multitask Learning. In B. Berger, editor, Research in Computational 
Molecular Biology, pages 522-534. Springer, 2010a. 

C. Widmer, N. Toussaint, Y. Altun, O. Kohlbacher, and G. Ratsch. Novel machine learning 
methods for MHC Class I binding prediction. In Pattern Recognition in Bioinformatics, 
pages 98-109. Springer, 2010b. 

C. Widmer, N. Toussaint, Y. Altun, and G. Ratsch. Inferring latent task structure for 
Multitask Learning by Multiple Kernel Learning. BMC bioinformatics, 11 Suppl 8(Suppl 
8):S5, Jan. 2010c. ISSN 1471-2105. doi; 10.1186/1471-2105-11-S8-S5. 

C. Widmer, M. Kloft, N. Gornitz, and G. Ratsch. Efficient Training of Graph-Regularized 
Multitask SVMs. In ECML 2012, 2012. 

C. Widmer, M. Kloft, X. Lou, and G. Ratsch. Regularization-based Multitask Learning 
With applications to Genome Biology and Biomedical Imaging. Kiinstliche Intelligenz, 
2013a. 

C. Widmer, M. Kloft, and G. Ratsch. Multi-task learning for computational biology: 
Overview and outlook. In B. Scholkopf, Z. Luo, and V. Vovk, editors, Empirical In¬ 
ference - Festschrift in Honor of Vladimir N. Vapnik, pages 117-127. Springer, 2013b. 

C. Widmer, S. Heinrich, P. Drewe, X. Lou, S. Umrania, and G. Ratsch. Graph-regularized 
3d shape reconstruction from highly anisotropic and noisy images. Signal Image Video 
Process, 8(1 Suppl):41-48, Dec 2014. doi; 10.1007/sll760-014-0694-8. 

Z. Xu, R. Jin, H. Yang, 1. King, and M. Lyu. Simple and efficient multiple kernel learning by 
group lasso. In Proceedings of the 27th Conference on Machine Learning (ICML 2010), 
2010. 

Y. Zhang and D. Yeung. A convex formulation for learning task relationships in multi-task 
learning. arXiv preprint arXiv:1203.3536, 2010. 

J. Zhou, J. Chen, and J. Ye. Clustered Multi-Task Learning Via Alternating Structure 
Optimization. Advances in Neural Information Processing Systems 24, pages 1-9, 2011. 


34 



